Importing Libraies:-

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Importing Data Set:-

In [2]:

df = pd.read_csv('companies.csv')
df

Out[2]:

Company_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

Promotions / Appraisal, Salary & Benefits

73.1k

856.9k

6.1k

847

11.5k

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

NaN

46.4k

584.6k

4.3k

9.9k

7.1k

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

Promotions / Appraisal

41.7k

561.5k

3.6k

460

5.8k

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

Promotions / Appraisal, Salary & Benefits

39.2k

427.4k

3.7k

405

5k

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

Promotions / Appraisal, Salary & Benefits

34k

414.4k

2.8k

719

4k

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

NaN

72

454

2

26

21

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

NaN

72

799

15

9

13

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

NaN

72

489

3

11

8

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

NaN

72

520

4

1

10

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

NaN

72

585

7

5

14

10000 rows × 10 columns

A Quick Summary of a Data Frame:-

In [3]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Company_name          10000 non-null  object 
 1   Description           10000 non-null  object 
 2   Ratings               10000 non-null  float64
 3   Highly_rated_for      9908 non-null   object 
 4   Critically_rated_for  2807 non-null   object 
 5   Total_reviews         10000 non-null  object 
 6   Avg_salary            10000 non-null  object 
 7   Interviews_taken      10000 non-null  object 
 8   Total_jobs_available  10000 non-null  object 
 9   Total_benefits        10000 non-null  object 
dtypes: float64(1), object(9)
memory usage: 781.4+ KB

In [4]:

df.shape

Out[4]:

(10000, 10)

In [6]:

df.index

Out[6]:

RangeIndex(start=0, stop=10000, step=1)

In [7]:

df.dtypes

Out[7]:

Company_name             object
Description              object
Ratings                 float64
Highly_rated_for         object
Critically_rated_for     object
Total_reviews            object
Avg_salary               object
Interviews_taken         object
Total_jobs_available     object
Total_benefits           object
dtype: object

In [8]:

df.isnull()

Out[8]:

Company_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

0

False

False

False

False

False

False

False

False

False

False

1

False

False

False

False

True

False

False

False

False

False

2

False

False

False

False

False

False

False

False

False

False

3

False

False

False

False

False

False

False

False

False

False

4

False

False

False

False

False

False

False

False

False

False

...

...

...

...

...

...

...

...

...

...

...

9995

False

False

False

False

True

False

False

False

False

False

9996

False

False

False

False

True

False

False

False

False

False

9997

False

False

False

False

True

False

False

False

False

False

9998

False

False

False

False

True

False

False

False

False

False

9999

False

False

False

False

True

False

False

False

False

False

10000 rows × 10 columns

In [9]:

df.isnull().sum()

Out[9]:

Company_name               0
Description                0
Ratings                    0
Highly_rated_for          92
Critically_rated_for    7193
Total_reviews              0
Avg_salary                 0
Interviews_taken           0
Total_jobs_available       0
Total_benefits             0
dtype: int64

In [10]:

df.isnull().sum().sum()

Out[10]:

7285

In [11]:

df.describe()

Out[11]:

Ratings

count

10000.000000

mean

3.894710

std

0.385894

min

1.300000

25%

3.700000

50%

3.900000

75%

4.100000

max

5.000000

Data Cleaning:-

Data cleaning, also known as data cleansing or data wrangling, is a crucial step in the data analytics process. It involves identifying, correcting, and formatting raw data to ensure its accuracy, consistency, and completeness before analysis.

Here's why data cleaning is essential: Garbage in, garbage out: Unreliable or inaccurate data leads to misleading and unreliable results. Cleaning ensures the foundation of your analysis is solid.

Improves analysis efficiency :

Clean data allows for smoother and faster analysis, saving you time and effort. #Enables better decision-making: Accurate insights derived from clean data empower you to make informed and effective decisions.

What does data cleaning involve? Data cleaning encompasses various tasks, depending on the specific dataset and its quality. Here are some common steps:

Identifying and removing errors: This includes finding and correcting typos, inconsistencies in formatting, and outliers that deviate significantly from the norm.

Handling missing values:

Missing data points can be dealt with by imputation (filling in missing values), deletion, or other techniques depending on the context.

Formatting inconsistencies:

Ensuring consistent formatting across data points, such as date formats, units of measurement, and capitalization, is crucial. Detecting and removing duplicates: Duplicate entries can skew analysis, so identifying and removing them is essential. Standardizing data: Transforming data into a consistent format, like scaling numerical values or converting categorical data into numerical codes, facilitates analysis.

Improved data quality:

Cleaning leads to more reliable and trustworthy data, enhancing the credibility of your analysis.

Enhanced analysis accuracy:

Clean data ensures your analysis reflects the true underlying patterns and relationships within the data.

Efficient data manipulation:

Clean data allows for smoother and faster manipulation and transformation during analysis.

Better decision-making:

Ultimately, clean data empowers you to make informed and effective decisions based on accurate insights. Tools and techniques for data cleaning:

NoProgramming languages:

Python with libraries like Pandas and NumPy is popular for data cleaning tasks.

Spreadsheets:

While suitable for smaller datasets, tools like Microsoft Excel can be used for basic cleaning tasks.

Data cleaning software:

Specialized software offers advanced features and automation for complex cleaning tasks.

HANDLING MISSING VALUES:-

In [65]:

df2=df.fillna(value=0)
df2

Out[65]:

Company_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

3.8

73.1k

856.9k

6.1k

847

11.5k

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

4.0

46.4k

584.6k

4.3k

9.9k

7.1k

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

3.9

41.7k

561.5k

3.6k

460

5.8k

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

3.8

39.2k

427.4k

3.7k

405

5k

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

3.9

34k

414.4k

2.8k

719

4k

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

3.7

72

454

2

26

21

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

2.6

72

799

15

9

13

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

3.7

72

489

3

11

8

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

3.3

72

520

4

1

10

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

4.5

72

585

7

5

14

10000 rows × 10 columns

In [66]:

df3=df.fillna({'Critically_rated_for':'NaN','Highly_rated_for':'NaN'})
df3

Out[66]:

Company_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

3.8

73.1k

856.9k

6.1k

847

11.5k

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

4.0

46.4k

584.6k

4.3k

9.9k

7.1k

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

3.9

41.7k

561.5k

3.6k

460

5.8k

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

3.8

39.2k

427.4k

3.7k

405

5k

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

3.9

34k

414.4k

2.8k

719

4k

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

3.7

72

454

2

26

21

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

2.6

72

799

15

9

13

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

3.7

72

489

3

11

8

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

3.3

72

520

4

1

10

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

4.5

72

585

7

5

14

10000 rows × 10 columns

In [64]:

df3.isnull().sum()

Out[64]:

Company_name            0
Description             0
Ratings                 0
Highly_rated_for        0
Critically_rated_for    0
Total_reviews           0
Avg_salary              0
Interviews_taken        0
Total_jobs_available    0
Total_benefits          0
dtype: int64

Imputation:

In [78]:

import sklearn
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['Critically_rated_for'] = imputer.fit_transform(df[['Critically_rated_for']])

Duplicated:-

In [29]:

df.duplicated()

Out[29]:

0       False
1       False
2       False
3       False
4       False
        ...  
9995     True
9996     True
9997     True
9998    False
9999    False
Length: 10000, dtype: bool

In [68]:

df.duplicated().sum()

Out[68]:

641

Filling Duplicated values:-

In [39]:

df4=df.fillna(value=False)
df4

Out[39]:

Company_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

3.8

73.1k

856.9k

6.1k

847

11.5k

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

4.0

46.4k

584.6k

4.3k

9.9k

7.1k

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

3.9

41.7k

561.5k

3.6k

460

5.8k

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

3.8

39.2k

427.4k

3.7k

405

5k

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

3.9

34k

414.4k

2.8k

719

4k

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

3.7

72

454

2

26

21

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

2.6

72

799

15

9

13

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

3.7

72

489

3

11

8

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

3.3

72

520

4

1

10

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

4.5

72

585

7

5

14

10000 rows × 10 columns

In [57]:

df4.isnull().sum()

Out[57]:

0

In [42]:

df4.columns

Out[42]:

Index(['Company_name', 'Description', 'Ratings', 'Highly_rated_for',
       'Critically_rated_for', 'Total_reviews', 'Avg_salary',
       'Interviews_taken', 'Total_jobs_available', 'Total_benefits'],
      dtype='object')

Formatting Inconsistencies:

In [50]:

df4=df['Company_name'].str.upper()
df4

Out[50]:

0                                 TCS
1                           ACCENTURE
2                           COGNIZANT
3                               WIPRO
4                           CAPGEMINI
                    ...              
9995          TECHILA GLOBAL SERVICES
9996              RXLOGIX CORPORATION
9997    AVIANS INNOVATIONS TECHNOLOGY
9998                     ACPL SYSTEMS
9999                        BEROE INC
Name: Company_name, Length: 10000, dtype: object

In [61]:

df4.isnull()

Out[61]:

0       False
1       False
2       False
3       False
4       False
        ...  
9995    False
9996    False
9997    False
9998    False
9999    False
Name: Company_name, Length: 10000, dtype: bool

Renameing Columns:-

In [74]:

df5 = df.rename(columns={'Company_name': 'Companies_name', 'Highly_rated_for': 'High_rated','Total_jobs_available':'Total_jobs','Total_benefits':'Total_benefited'})
df5

Out[74]:

Companies_name

Description

Ratings

High_rated

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs

Total_benefited

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

3.8

73.1k

856.9k

6.1k

847

11.5k

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

4.0

46.4k

584.6k

4.3k

9.9k

7.1k

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

3.9

41.7k

561.5k

3.6k

460

5.8k

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

3.8

39.2k

427.4k

3.7k

405

5k

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

3.9

34k

414.4k

2.8k

719

4k

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

3.7

72

454

2

26

21

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

2.6

72

799

15

9

13

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

3.7

72

489

3

11

8

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

3.3

72

520

4

1

10

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

4.5

72

585

7

5

14

10000 rows × 10 columns

unique values:

In [76]:

df.nunique()

Out[76]:

Companies_name          9355
Description             9330
Ratings                   34
Highly_rated_for         253
Critically_rated_for      34
Total_reviews            889
Avg_salary              1229
Interviews_taken         306
Total_jobs_available     309
Total_benefits           471
dtype: int64

Creating a new column:

In [83]:

df['Category'] = df['Description'].str.split('|').str[0].str.strip()
df

Out[83]:

Companies_name

Description

Ratings

Highly_rated_for

Critically_rated_for

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

Category

0

TCS

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security, Work Life Balance

3.8

73.1k

856.9k

6.1k

847

11.5k

IT Services & Consulting

1

Accenture

IT Services & Consulting | 1 Lakh+ Employees |...

4.0

Company Culture, Skill Development / Learning,...

4.0

46.4k

584.6k

4.3k

9.9k

7.1k

IT Services & Consulting

2

Cognizant

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Skill Development / Learning

3.9

41.7k

561.5k

3.6k

460

5.8k

IT Services & Consulting

3

Wipro

IT Services & Consulting | 1 Lakh+ Employees |...

3.8

Job Security

3.8

39.2k

427.4k

3.7k

405

5k

IT Services & Consulting

4

Capgemini

IT Services & Consulting | 1 Lakh+ Employees |...

3.9

Job Security, Work Life Balance, Skill Develop...

3.9

34k

414.4k

2.8k

719

4k

IT Services & Consulting

...

...

...

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting | 501-1k Employees | ...

3.7

Work Life Balance, Salary & Benefits, Company ...

3.7

72

454

2

26

21

IT Services & Consulting

9996

RxLogix Corporation

Pharma | 201-500 Employees | 14 years old | Pr...

2.6

Work Life Balance, Work Satisfaction, Company ...

2.6

72

799

15

9

13

Pharma

9997

Avians Innovations Technology

Building Material | 51-200 Employees | 17 year...

3.7

Promotions / Appraisal, Work Satisfaction, Sal...

3.7

72

489

3

11

8

Building Material

9998

ACPL Systems

Law Enforcement & Security | 51-200 Employees ...

3.3

Promotions / Appraisal, Salary & Benefits, Wor...

3.3

72

520

4

1

10

Law Enforcement & Security

9999

Beroe Inc

Management Consulting | 201-500 Employees | 19...

4.5

Work Life Balance, Job Security, Company Culture

4.5

72

585

7

5

14

Management Consulting

10000 rows × 11 columns

Exploratory Data Analysis

EDA stands for Exploratory Data Analysis. It's an approach used to analyze and investigate datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. Here are some key points about EDA:

Main Goals:

Understanding the data: EDA helps you gain a deeper understanding of the data you're working with, including its structure, distribution, relationships between variables, and potential outliers. Identifying patterns and trends: By exploring the data, you can discover hidden patterns, trends, and anomalies that might not be readily apparent from simply looking at the raw data. Formulating hypotheses: Based on your observations during EDA, you can formulate hypotheses about the data that can be further tested through statistical modeling or other methods.

key Techniques:

Data visualization: Creating histograms, scatter plots, boxplots, and other visualizations helps you see the distribution of data, identify outliers, and understand relationships between variables.

Descriptive statistics: Calculating summary statistics like mean, median, standard deviation, and quartiles helps you quantify the central tendency and spread of the data.

Data cleaning: Identifying and handling missing values, outliers, and inconsistencies in the data is crucial for reliable analysis.

Benfits of EDA:

Improved understanding of data: EDA provides a foundation for further analysis and modeling.

Identification of potential issues: EDA helps you spot data quality problems and potential biases.

Generation of insights and hypotheses: EDA can lead to the discovery of interesting patterns

In [94]:

sns.pairplot(df)
plt.show()
C:\Users\varda\anaconda3\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

Computing Correlation Matrix:

In [95]:

import matplotlib.pyplot as plt
numeric_columns = ['Total_reviews', 'Avg_salary', 'Interviews_taken', 'Total_jobs_available', 'Total_benefits']
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
correlation_matrix = df[numeric_columns].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix')
plt.show()

In [97]:

null_count = df['Avg_salary'].isnull().sum()
print(f"\nNumber of null values in 'Avg_salary': {null_count}")
 
Number of null values in 'Avg_salary': 3856

In [99]:

df.dropna(subset=['Avg_salary'], inplace=True)
print("\nDataFrame after dropping null values:")
df
 
DataFrame after dropping null values:

Out[99]:

Companies_name

Category

Ratings

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

213

Marpu Foundation

Non-Profit

4.9

NaN

13.0

24.0

NaN

84.0

443

Taurus BPO Services

BPO

4.6

NaN

647.0

14.0

1.0

524.0

499

PHN Technology

Pune +30 more

4.6

NaN

186.0

21.0

NaN

25.0

583

Exotic Learning

EdTech

4.5

NaN

368.0

41.0

14.0

183.0

801

Karma Ayurveda

Healthcare

4.5

876.0

388.0

4.0

NaN

33.0

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting

3.7

72.0

454.0

2.0

26.0

21.0

9996

RxLogix Corporation

Pharma

2.6

72.0

799.0

15.0

9.0

13.0

9997

Avians Innovations Technology

Building Material

3.7

72.0

489.0

3.0

11.0

8.0

9998

ACPL Systems

Law Enforcement & Security

3.3

72.0

520.0

4.0

1.0

10.0

9999

Beroe Inc

Management Consulting

4.5

72.0

585.0

7.0

5.0

14.0

6144 rows × 8 columns

In [100]:

null_count = df['Total_benefits'].isnull().sum()
print(f"\nNumber of null values in 'Total_benefits': {null_count}")
 
Number of null values in 'Total_benefits': 66

In [102]:

df.dropna(subset=['Total_benefits'], inplace=True)
print("\nDataFrame after dropping null values:")
df
 
DataFrame after dropping null values:

Out[102]:

Companies_name

Category

Ratings

Total_reviews

Avg_salary

Interviews_taken

Total_jobs_available

Total_benefits

213

Marpu Foundation

Non-Profit

4.9

NaN

13.0

24.0

NaN

84.0

443

Taurus BPO Services

BPO

4.6

NaN

647.0

14.0

1.0

524.0

499

PHN Technology

Pune +30 more

4.6

NaN

186.0

21.0

NaN

25.0

583

Exotic Learning

EdTech

4.5

NaN

368.0

41.0

14.0

183.0

801

Karma Ayurveda

Healthcare

4.5

876.0

388.0

4.0

NaN

33.0

...

...

...

...

...

...

...

...

...

9995

Techila Global Services

IT Services & Consulting

3.7

72.0

454.0

2.0

26.0

21.0

9996

RxLogix Corporation

Pharma

2.6

72.0

799.0

15.0

9.0

13.0

9997

Avians Innovations Technology

Building Material

3.7

72.0

489.0

3.0

11.0

8.0

9998

ACPL Systems

Law Enforcement & Security

3.3

72.0

520.0

4.0

1.0

10.0

9999

Beroe Inc

Management Consulting

4.5

72.0

585.0

7.0

5.0

14.0

6078 rows × 8 columns

Targeting &Featureing

In [117]:

y = df.Ratings
df_features = ['Avg_salary', 'Total_benefits']
X = df[df_features]

In [118]:

X.describe()

Out[118]:

Avg_salary

Total_benefits

count

6078.000000

6078.000000

mean

548.676538

16.665021

std

224.526073

17.807252

min

2.000000

1.000000

25%

388.000000

9.000000

50%

538.000000

13.000000

75%

718.000000

19.000000

max

999.000000

524.000000

In [119]:

X.head()

Out[119]:

Avg_salary

Total_benefits

213

13.0

84.0

443

647.0

524.0

499

186.0

25.0

583

368.0

183.0

801

388.0

33.0

Reseting Index:

In [121]:

X.reset_index(drop=True, inplace=True)
print("\nDataFrame after resetting the index:")
X
 
DataFrame after resetting the index:

Out[121]:

Avg_salary

Total_benefits

0

13.0

84.0

1

647.0

524.0

2

186.0

25.0

3

368.0

183.0

4

388.0

33.0

...

...

...

6073

454.0

21.0

6074

799.0

13.0

6075

489.0

8.0

6076

520.0

10.0

6077

585.0

14.0

6078 rows × 2 columns

In [122]:

y.head(10)

Out[122]:

213    4.9
443    4.6
499    4.6
583    4.5
801    4.5
839    4.3
866    4.6
868    4.7
895    4.5
900    4.9
Name: Ratings, dtype: float64

In [124]:

y.reset_index(drop=True, inplace=True)
print("\nDataFrame after resetting the index:")
y
 
DataFrame after resetting the index:

Out[124]:

0       4.9
1       4.6
2       4.6
3       4.5
4       4.5
       ... 
6073    3.7
6074    2.6
6075    3.7
6076    3.3
6077    4.5
Name: Ratings, Length: 6078, dtype: float64

In [126]:

from sklearn.tree import DecisionTreeRegressor
P_model = DecisionTreeRegressor(random_state=1)
P_model.fit(X, y)

Out[126]:

DecisionTreeRegressor(random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [127]:

print("Making predictions for the following 5 Company Ratings:")
print(X.head())
print("The predictions are")
print(P_model.predict(X.head()))
Making predictions for the following 5 Company Ratings:
   Avg_salary  Total_benefits
0        13.0            84.0
1       647.0           524.0
2       186.0            25.0
3       368.0           183.0
4       388.0            33.0
The predictions are
[4.9 4.6 4.6 4.5 4.5]

In [128]:

df.boxplot(figsize=(20,10))

Out[128]:

<Axes: >

In [129]:

df3.plot()

Out[129]:

<Axes: >

In [135]:

plt.scatter(x=df['Avg_salary'],y=df['Total_benefits'],color='blue')
plt.xticks(rotation=70)
plt.xlabel('Avg_salary')
plt.ylabel('Total_benefits')
plt.show()

In [145]:

 

In [146]:

plt.hist(df['Ratings'],color='orange',bins=50)
plt.show()

In [148]:

plt.scatter(x='Companies_name',y='Ratings',data=df,c='g',s=100)

Out[148]:

<matplotlib.collections.PathCollection at 0x27c3223ce50>

In [149]:

plt.scatter(x='Companies_name',y='Avg_salary',data=df,c='g',s=100)

Out[149]:

<matplotlib.collections.PathCollection at 0x27c31627050>

Conclusion:

In [ ]: